Modelling interaction in the lexicon

  • The big picture: human languages evolve on a cultural timescale:
    • individual utterance selection > language change > language evolution
  • Massive centuries-spanning corpora compiled in the recent years open up an unprecedented avenue of possible investigations into language dynamics.
  • (cf. Cuskley et al., 2014; Feltgen et al., 2017; Frermann and Lapata, 2016; Gulordava and Baroni, 2011; Hamilton et al., 2016; Newberry et al., 2017; Petersen et al., 2012; Bocharev et al., 2014; Sagi et al., 2011; Schlechtweg et al., 2017; Wijaya and Yeniterzi, 2011)
  • Word usage frequencies but also word meaning using distributional semantics methods

  • What I’m interested in: as new words - e.g. neologisms & borrowings - are selected for, what happens to their older synonyms?
  • Identified two confounds that need to be controlled for
  • Results based on simply counting words can lead to spurious results
    • a big change may well be driven by a change in topic composition (1)
  • Automatic distribution-based similarity measures are useful for quantifying both meaning and meaning change
    • but apparent semantics tend to change when frequency changes (2)

1. Fluctuations on topic frequencies

  • Observation: the ebb and flow of discourse topics in a diachronic corpus reflects real-world events (wars->ware-related news->frequency of military words increases)
  • Token frequency ~ probability of usage ~ fitness ~ being selected for
  • However: corpus frequencies may be misleading (Chelsey & Baayen, 2010; Lijffijt et al., 2012; Calude, et al., 2017; Szmrecsanyi 2016)
  • Observation: sometimes similar words both increase in frequency, instead of competing; or emergence of new words often coincides with the frequency increase of similar words, not decrease.
  • Frequency change might not necessarily imply selection.


The topical-cultural advection model

  • Control for diachronic topical fluctuations by quantifying the frequency change of a word’s topic.
  • advection: ‘the transport of substance, particularly fluids, by bulk motion’
  • Formalized as the weighted mean of the log frequency changes of the relevant topic (context) words of the target word

How does this work?

  • Generate a “topic” for each target word, consisting of m context words, based on co-occurrence

  • Topical advection: a measure of how much topic/context words like mocha have changed on average (weighted by some association score) between two periods.
  • latte: calculate its log frequency change (e.g. +1.19 between 1990s->2000s)
  • calculate its topical advection: +0.07 (weighted mean log frequency change in context words) (see Appendix for math)

How well does it work?

  • Correlate the log frequency changes of all (sufficiently frequent) nouns between two time periods to their respective topical advection values
  • What should we expect?


Interrim conclusions

2. Frequency change bias in semantic change measures

  • The two confounds that need to be controlled for…
    • Topical fluctuations ✔
    • Interplay of frequency change and semantic change measures
  • Observation: frequency change in a word appears to affect its (distributional) semantics.
  • Distributional semantics ~ topic modelling, all based on contextual co-occurrence; semantic change ~ semantic (self-)similarity of a word between temporal subcorpora
  • If the frequency difference of a word between two time periods affects its semantics, this would be a problem for semantic change measures
    (cf. Dubossarsky et al. 2017 for more critique of automated semantic change measures)

Testing approach

  • Simulate frequency change of a word between subcorpora and measure semantics change
  • But instead of actual different subcorpora, use data from one single corpus (2000-2009 in COHA), and generate different versions of it (corpus\('\)) where the occurrences of some target word \(w\) have been downsampled by relabelling a fixed portion of them as \(w'\)
  • Measure the similarity of \(w\) in the original corpus -> to \(w'\) in corpus\('\).
  • Null hypothesis: no semantic change should occur (actually the same word)

  • 100 random words (nouns) from equally spaced log frequency bands, 25 downsample sizes \(s \in [0.1, 7]\)
  • For each \(w\) with frequency \(f\), and each \(s\), relabel a portion \(e^{ln(f) - s} = f/e^s\) (excl. downsamples \(n<10\))
  • E.g., if \(f=1000\), \(s=0.7\), then \(1000/e^0.7 \approx 496\), or a -50.3% reduction.
  • For each downsampled \(w'\), measure its semantic similarity to the original word, using 5 different distributional approaches (with 10x replications for each combination):
    • full count vectors (no dimension reduction), cosine similarity
    • full vectors, but PPMI weighted, cosine similarity
    • APSyn rank-based similarity, using top 100 PPMI-weighted terms (Santus et al. 2016)
    • Latent Semantic Analysis (SVD) embeddings of count vectors, cosine similarity
    • GloVe embeddings of count vectors, cosine similarity (Pennington et al. 2014)




Conclusions (part 2)

  • All 5 semantic similarity methods exhibit the bias, but the extent is variable; the bias is more predictable by frequency band in some methods, less in others
  • Vector space density matters: a large change value does not necessarily correspond to a categorical change in semantics in a sparse space; but similarity rank between \(w\) and \(w'\) is comparable between methods
  • Good news: change to the extent of becoming a “different word” (\(w\) not the closest synonym for \(w'\)) occurs mostly at low frequencies (<100), which should be considered unreliable anyway
  • Some methods (APSyn, GLoVe) are more susceptible, while the very simple method of measuring the cosine over an unreduced PPMI-weighted vector space performs best.
  • The downsampling apporach is extendable to actual diachronic corpora, to compare observed semantic change against expected change stemming from frequency difference.

Future work

  • As new words are selected for, what happens to their older synonyms?
  • Observation: competition may manifest in at least two ways:
    • the losing variant decreases in usage frequency
    • or it changes meaning, while the form remains in use (e.g. radio <-> wireless, beef <-> cow) )
    • …but at times near-synonyms both successfully remain in use
  • Hypothesis: high semantic similarity (introduced by emergent novel words or semantic change) leads to competition between similar variants1 - unless there is sufficient communicative need2 in the lexical subspace to sustain near-synonymy.
    • 1 apparent by diverging frequency or diverging semantics, but the semantic change must be controlled for bias
      2 as measured by the advection model









Appendix


Math (the topical-cultural advection model)

The advection value of a word in time period \(t\) is defined as the weighted mean of the changes in frequencies (compared to the previous period) of those associated words. More precisely, the topical advection value for a word \(\omega\) at time period \(t\) is

\[\begin{equation} {\rm advection}(\omega;t) := {\rm weightedMean}\big( \{ {\rm logChange}(N_i;t) \mid i=1,...m \}, \, W \big) \end{equation}\]

where \(N\) is the set of \(m\) words associated with the target at time \(t\) and \(W\) is the set of weights (to be defined below) corresponding to those words. The weighted mean is simply

\[\begin{equation} {\rm weightedMean}(X, W) := \frac{\sum x_i w_i }{\sum w_i} \end{equation}\]

where \(x_i\) and \(w_i\) are the \(i^{\rm th}\) elements of the sets \(X\) and \(W\) respectively. The log change for period \(t\) for each of the associated words \(\omega'\) is given by the change in the logarithm of its frequencies from the previous to the current period. That is,

\[\begin{equation} {\rm logChange}(\omega';t) := \log[f(\omega';t)+1] - \log[f(\omega';t-1)+1] \end{equation}\]

where \(f(\omega';t)\) is the number of occurrences of word \(\omega'\) in the time period \(t\). Note we add \(1\) to these frequency counts, to avoid \(\log(0)\) appearing in the expression.

Parameters

  • Used the COHA corpus, divided into decade subcorpora
  • Preprocessing: lemmatization and stopword removal, and misc cleaning; used only content words, and excluded proper nouns; the advection model was applied to common nouns only.
  • Excluded words with less than 100 occurrences
  • Used the top 100 PPMI-weighted context words (from a window of +/-5) for the simpler approach; an LDA model yielded comparable results (see full paper for more details)

Math and parameters (the frequency change - semantic change simulation)

  • Used the same data (COHA) as in the previous section, but limited to the last decade (2000-2009), which is ~10m words after cleaning
  • Context window for co-occurrence is +/-5, linearly weighted
  • The methods:
    • cosine similarity over plain co-occurrence count matrix (no dimension reduction), i.e. vector length is the entire lexicon (~10k)
    • cosine over PPMI-weighted full vectors (same, full lexicon)
    • APSyn (N=100 top PPMI-weighted terms are used for the rank comparison)
    • cosine over LSA-reduced co-occurrence matrix (300dim)
    • cosine over GLoVe-reduced co-occurrence matrix (100dim, 20 iterations with early stopping allowed, learning rate = 0.15, alpha = 0.75, lambda = 0; parameter tuning and longer training might improve results)
  • The downsampling:
  • Sampled 100 nouns from equally spaced log frequency bands, with frequencies in \([510, 50863]\)
  • Defined a sequency of 25 downsample sizes \(s \in [0.1, 7]\), the results also include a “sanity check” of \(s=0\), where no reduction is applied, only random reshuffling

For each \(w\) with an original frequency \(f\), and each \(s\), downsampled by randomly relabeling a fixed portion of its occurrences as \(w'\) in the corpus, where the portion is defined as \(e^{ln(f) - s} = f/e^s\) (exluded downsamples with \(<10\) occurrences)
E.g., if \(f=1000\), \(s=0.7\), then \(e^{ ln(1000) - 0.7} \approx 496\), or a -50.3% reduction.




*This research was supported by the scholarship program Kristjan Jaak, funded and managed by the Archimedes Foundation in collaboration with the Ministry of Education and Research of Estonia.